A Unifying Approach to HTML
نویسندگان
چکیده
The number, the size, and the dynamics of Internet information sources bears abundant evidence of the need for automation in information extraction. This calls for representation formalisms that match the World Wide Web reality and for learning approaches and learnability results that apply to these formalisms. The concept of elementary formal systems is appropriately generalized to allow for the representation of wrapper classes which are relevant to the description of Internet sources in HTML format. Related learning results prove that those wrappers are automatically learnable from examples. This is setting the stage for information extraction from the Internet by exploitation of inductive learning techniques. 1 Motivation Today's online access to millions or even billions of documents in the World Wide Web is a great challenge to research areas related to knowledge discovery and information extraction (IE). The general task of IE is to locate speciic pieces of text in a natural language document. The authors' approach draws advantage from the fact that all documents prepared for the Internet in HTML, in XML or in any other possibly forthcoming syntax have to be interpreted by browsers sitting anywhere in the World Wide Web. For this purpose, the documents do need to contain syntactic expressions which are controlling its interpretation including its visual appearance and its interactive behaviour. In HTML, these are the text formatting and annotating strings (tags), and in L A T E X, for instance, there are numerous commands. The
منابع مشابه
The Origin and Limitations of Modern Mathematical Economics: A Historical Approach
We have first demonstrated that Debreu’s view regarding the publication of The Theory of Games and Economic Behavior by von Neumann and Morgenstern in 1944 as the birth of modern mathematical economics is not convincing. In this paper, we have proposed the hypothesis that the coordinated research programs in the 1930’s, initiated by the Econometric Society and the Cowles Commission for Research...
متن کاملDefinition of Architecture
This paper seeks to investigate a new definition for architecture by unifying the three Vitruvian principles of firmitas, utilities, and venustas via a phenomenological approach in the interpretation and analysis of their role in defining architecture. The paper is composed in two main sections. The first section investigates the nature of architecture based on the mentioned principles, where a...
متن کاملUnifying Textual and Visual Cues for Content-Based Image Retrieval on the World Wide Web
A system is proposed that combines textual and visual statistics in a single index vector for content-based search of a WWW image database. Textual statistics are captured in vector form using latent semantic indexing based on text in the containing HTML document. Visual statistics are captured in vector form using color and orientation histograms. By using an integrated approach, it becomes po...
متن کاملFixed Point Results on $b$-Metric Space via Picard Sequences and $b$-Simulation Functions
In a recent paper, Khojasteh emph{et al.} [F. Khojasteh, S. Shukla, S. Radenovi'c, A new approach to the study of fixed point theorems via simulation functions, Filomat, 29 (2015), 1189-–1194] presented a new class of simulation functions, say $mathcal{Z}$-contractions, with unifying power over known contractive conditions in the literature. Following this line of research, we extend and ...
متن کامل